AITopics | speech representation

Collaborating Authors

speech representation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SSDM: Scalable Speech Dysfluency Modeling

Neural Information Processing SystemsFeb-17-2026, 17:27:49 GMT

However, there are three challenges.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Netherlands > Gelderland > Nijmegen (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(6 more...)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.92)
Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

Neural Information Processing SystemsFeb-16-2026, 23:25:21 GMT

The code is available here.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Asia > Nepal (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)
Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology > Security & Privacy (0.95)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Security & Privacy (0.95)
(3 more...)

Add feedback

b6404bf461c3c3186bdf5f55756af908-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 17:11:31 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Poland > Lower Silesia Province > Wroclaw (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

83d349b6eb8125588b5f091e2d47525c-Paper-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 09:16:05 GMT

arxiv preprint arxiv, speech recognition, speech ssl model, (10 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

92d1e1eb1cd6f9fba3227870bb6d7f07-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 09:06:36 GMT

overlap, time step, transf, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Neural Information Processing SystemsDec-24-2025, 07:27:24 GMT

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

name change, self-supervised learning, speech representation, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.41)

Add feedback

SENSE models: an open source solution for multilingual and multimodal semantic-based tasks

Mdhaffar, Salima, Elleuch, Haroun, Chellaf, Chaimae, Nguyen, Ha, Estève, Yannick

arXiv.org Artificial IntelligenceDec-10-2025

Abstract--This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI's SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder . The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders. Speech foundation models based on self-supervised learning (SSL) have brought significant advances in speech processing. These models, such as wav2vec 2.0 [1], HuBERT [2], and WavLM [3], generate learned speech representations that can be applied to a wide range of downstream speech processing tasks. By training on large amounts of unlabelled speech data, SSL models have demonstrated the ability to capture crucial speech features, such as phonemes and other acoustic units [4]. This capability has led to significant progress in multiple downstream tasks, including speech recognition [1], speech translation [5], speech separation, speaker verification, speaker diarization [3], and emotion detection [6]. Different approaches have been proposed to pretrain model by aligning speech and text, like mSLAM [7], a Massively multilingual joint pre-training for speech and text.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.12093

Country: Africa (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

Ioannides, Georgios, Constantinou, Christos, Chadha, Aman, Elkins, Aaron, Pang, Linsey, Shwartz-Ziv, Ravid, LeCun, Yann

arXiv.org Artificial IntelligenceDec-9-2025

We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage 1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage 2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5 Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2512.07168

Genre: Research Report (0.53)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs

Tseng, Wei-Cheng, Harwath, David

arXiv.org Artificial IntelligenceNov-21-2025

Abstract--Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2V ec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2V ec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5 and training time by 2.3, showcasing its scalability and efficiency. Over the past several years, the speech processing community has rapidly adopted self-supervised learning (SSL) followed by supervised fine-tuning as a general-purpose modeling approach for tasks ranging from automatic speech recognition and emotion recognition to speaker verification [1]- [3].

artificial intelligence, machine learning, representation, (14 more...)

arXiv.org Artificial Intelligence

2511.16639

Country: North America > United States > Texas (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.68)

Technology: